Record: 1.1194 BPB — v9 Batched Muon + Full GPTQ Random Calib + JEPA Research by NewyorkDev · Pull Request #1124 · openai/parameter-golf

NewyorkDev · 2026-03-30T07:21:10Z

Summary

val_bpb: 1.1194 (3-seed mean, std 0.0002) | 15.90 MB max artifact | 8xH100 SXM, 600s
Key innovation: Batched Newton-Schulz orthogonalization via torch.bmm — groups 66 weight matrices into 4 shape-matched batches, 5% optimizer speedup, ~400 extra training steps
Full GPTQ with random token calibration — compliant Hessian collection without training data access
20+ hours of JEPA/STP research — 14 ablation tests proving auxiliary losses hurt at this scale, documented with full tables

Results (3 seeds)

Seed	Sliding BPB	Artifact
1337	1.1191	15.90 MB
42	1.1195	15.98 MB
7	1.1195	15.90 MB
Mean	1.1194	-

Research Contributions

JEPA negative result at 27M scale: 14 controlled ablations showing JEPA (Joint-Embedding Predictive Architecture) hurts training at 600s/27M params. Found and fixed gradient interference bug (67% penalty reduction), but still net negative.
STP negative result: Tested LeCun lab's Semantic Tube Prediction (arXiv 2602.22617) — zero-param JEPA variant. Also negative at this scale.
Label smoothing eval bug: Discovered and fixed a subtle bug where label smoothing was applied during eval via model.forward(), contaminating BPB measurements.
TTT on XSA-all: Confirmed PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019's finding that score-first TTT is ineffective when XSA covers all layers.

Compliance

3 seeds, all <= 600s training
All artifacts <= 16,000,000 bytes
No training data access during quantization (random calibration tokens)
No TTT on validation data
No network calls during evaluation
Single file train_gpt.py

Run Command

DATA_PATH=./data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
GPTQ_ENABLED=1 STP_ENABLED=0 TTT_ENABLED=0 LABEL_SMOOTHING=0.0 \
XSA_LAST_N=11 EVAL_STRIDE=64 SEED=1337 \
torchrun --nproc_per_node=8 train_gpt.py

Full research journey, ablation tables, and architecture details in the README.

Generated with Claude Code

3-seed mean 1.1194 BPB (std 0.0002) on 8xH100 SXM. Key innovation: batched Newton-Schulz via torch.bmm (5% optimizer speedup). Full GPTQ with random token calibration (compliant, no training data access). Extensive JEPA/STP research documented — 14 ablation tests proving auxiliary losses hurt at this scale. LZMA compression, XSA-all-11, FA3, LeakyReLU(0.5)^2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

NewyorkDev and others added 2 commits March 30, 2026 03:19

Update README with full 20+ hour research journey and ablation tables

e7b11d2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 1.1194 BPB — v9 Batched Muon + Full GPTQ Random Calib + JEPA Research#1124

Record: 1.1194 BPB — v9 Batched Muon + Full GPTQ Random Calib + JEPA Research#1124
NewyorkDev wants to merge 2 commits intoopenai:mainfrom
NewyorkDev:submission/v9-batched-muon-1.1194

NewyorkDev commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NewyorkDev commented Mar 30, 2026

Summary

Results (3 seeds)

Research Contributions

Compliance

Run Command

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant